2023-12-02 01:41

本文翻译自:Convert list of dictionaries to a pandas DataFrame

I have a list of dictionaries like this: 我有这样的词典列表:

[{'points': 50, 'time': '5:00', 'year': 2010}, {'points': 25, 'time': '6:00', 'month': "february"}, {'points':90, 'time': '9:00', 'month': 'january'}, {'points_h1':20, 'month': 'june'}]

And I want to turn this into a pandas DataFrame like this: 我想把它变成这样的pandas DataFrame :

month points points_h1 time year 0 NaN 50 NaN 5:00 2010 1 february 25 NaN 6:00 NaN 2 january 90 NaN 9:00 NaN 3 june NaN 20 NaN NaN

Note: Order of the columns does not matter. 注意:列的顺序无关紧要。

How can I turn the list of dictionaries into a pandas DataFrame as shown above? 如何将字典列表转换为如上所述的pandas DataFrame?





pd.DataFrame(d) #3楼



You can also use pd.DataFrame.from_dict(d) as : 您还可以将pd.DataFrame.from_dict(d)用作:

In [8]: d = [{'points': 50, 'time': '5:00', 'year': 2010}, ...: {'points': 25, 'time': '6:00', 'month': "february"}, ...: {'points':90, 'time': '9:00', 'month': 'january'}, ...: {'points_h1':20, 'month': 'june'}] In [12]: pd.DataFrame.from_dict(d) Out[12]: month points points_h1 time year 0 NaN 50.0 NaN 5:00 2010.0 1 february 25.0 NaN 6:00 NaN 2 january 90.0 NaN 9:00 NaN 3 june NaN 20.0 NaN NaN #5楼 How do I convert a list of dictionaries to a pandas DataFrame? 如何将字典列表转换为Pandas DataFrame?

The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. 其他答案是正确的,但是就这些方法的优点和局限性而言,并没有太多解释。 The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives. 这篇文章的目的是展示在不同情况下这些方法的示例,讨论何时使用(何时不使用),并提出替代方案。

DataFrame() , DataFrame.from_records() , and .from_dict() DataFrame() , DataFrame.from_records()和.from_dict()

Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don't work at all. 根据数据的结构和格式,在某些情况下,这三种方法要么全部起作用,要么某些方法比其他方法更好,或者有些根本不起作用。

Consider a very contrived example. 考虑一个非常人为的例子。

np.random.seed(0) data = pd.DataFrame( np.random.choice(10, (3, 4)), columns=list('ABCD')).to_dict('r') print(data) [{'A': 5, 'B': 0, 'C': 3, 'D': 3}, {'A': 7, 'B': 9, 'C': 3, 'D': 5}, {'A': 2, 'B': 4, 'C': 7, 'D': 6}]

This list consists of "records" with every keys present. 该列表由“记录”组成,其中包含每个键。 This is the simplest case you could encounter. 这是您可能遇到的最简单的情况。

# The following methods all produce the same output. pd.DataFrame(data) pd.DataFrame.from_dict(data) pd.DataFrame.from_records(data) A B C D 0 5 0 3 3 1 7 9 3 5 2 2 4 7 6 Word on Dictionary Orientations: orient='index' / 'columns' 词典定位词: orient='index' / 'columns'

Before continuing, it is important to make the distinction between the different types of dictionary orientations, and support with pandas. 在继续之前,重要的是要区分不同类型的字典方向和熊猫的支持。 There are two primary types: "columns", and "index". 有两种主要类型:“列”和“索引”。

orient='columns' Dictionaries with the "columns" orientation will have their keys correspond to columns in the equivalent DataFrame. 方向为“列”的字典的键将与等效DataFrame中的列相对应。

For example, data above is in the "columns" orient. 例如,以上data以“列”方向显示。

data_c = [ {'A': 5, 'B': 0, 'C': 3, 'D': 3}, {'A': 7, 'B': 9, 'C': 3, 'D': 5}, {'A': 2, 'B': 4, 'C': 7, 'D': 6}]

pd.DataFrame.from_dict(data_c, orient='columns') A B C D 0 5 0 3 3 1 7 9 3 5 2 2 4 7 6

Note: If you are using pd.DataFrame.from_records , the orientation is assumed to be "columns" (you cannot specify otherwise), and the dictionaries will be loaded accordingly. 注意:如果使用的是pd.DataFrame.from_records ,则假定方向为“列”(否则无法指定),并且将相应地加载字典。

orient='index' With this orient, keys are assumed to correspond to index values. 通过这种定向,键被假定为对应于索引值。 This kind of data is best suited for pd.DataFrame.from_dict . 这种数据最适合pd.DataFrame.from_dict 。

data_i ={ 0: {'A': 5, 'B': 0, 'C': 3, 'D': 3}, 1: {'A': 7, 'B': 9, 'C': 3, 'D': 5}, 2: {'A': 2, 'B': 4, 'C': 7, 'D': 6}}

pd.DataFrame.from_dict(data_i, orient='index') A B C D 0 5 0 3 3 1 7 9 3 5 2 2 4 7 6

This case is not considered in the OP, but is still useful to know. 在OP中不考虑这种情况,但仍然有用。

Setting Custom Index 设置自定义索引

If you need a custom index on the resultant DataFrame, you can set it using the index=... argument. 如果需要在结果DataFrame上使用自定义索引,则可以使用index=...参数进行设置。

pd.DataFrame(data, index=['a', 'b', 'c']) # pd.DataFrame.from_records(data, index=['a', 'b', 'c']) A B C D a 5 0 3 3 b 7 9 3 5 c 2 4 7 6

This is not supported by pd.DataFrame.from_dict . pd.DataFrame.from_dict不支持此pd.DataFrame.from_dict 。

Dealing with Missing Keys/Columns 处理缺少的键/列

All methods work out-of-the-box when handling dictionaries with missing keys/column values. 当处理缺少键/列值的字典时,所有方法都是开箱即用的。 For example, 例如,

data2 = [ {'A': 5, 'C': 3, 'D': 3}, {'A': 7, 'B': 9, 'F': 5}, {'B': 4, 'C': 7, 'E': 6}]

# The methods below all produce the same output. pd.DataFrame(data2) pd.DataFrame.from_dict(data2) pd.DataFrame.from_records(data2) A B C D E F 0 5.0 NaN 3.0 3.0 NaN NaN 1 7.0 9.0 NaN NaN NaN 5.0 2 NaN 4.0 7.0 NaN 6.0 NaN Reading Subset of Columns 读取列子集

"What if I don't want to read in every single column"? “如果我不想在每一列中阅读该怎么办”? You can easily specify this using the columns=... parameter. 您可以使用columns=...参数轻松指定。

For example, from the example dictionary of data2 above, if you wanted to read only columns "A', 'D', and 'F', you can do so by passing a list: 例如,从上面的data2示例字典中,如果您只想读取列“ A”,“ D”和“ F”,则可以通过传递一个列表来做到这一点:

pd.DataFrame(data2, columns=['A', 'D', 'F']) # pd.DataFrame.from_records(data2, columns=['A', 'D', 'F']) A D F 0 5.0 3.0 NaN 1 7.0 NaN 5.0 2 NaN NaN NaN

This is not supported by pd.DataFrame.from_dict with the default orient "columns". 具有默认方向“列”的pd.DataFrame.from_dict不支持此功能。

pd.DataFrame.from_dict(data2, orient='columns', columns=['A', 'B'])

ValueError: cannot use columns parameter with orient='columns' Reading Subset of Rows 读取行的子集

Not supported by any of these methods directly . 这些方法都不直接支持 。 You will have to iterate over your data and perform a reverse delete in-place as you iterate. 您将必须遍历数据,并在进行迭代时就地执行反向删除 。 For example, to extract only the 0 th and 2 nd rows from data2 above, you can use: 例如,只提取从第 0和第2 次的行data2以上,可以使用:

rows_to_select = {0, 2} for i in reversed(range(len(data2))): if i not in rows_to_select: del data2[i] pd.DataFrame(data2) # pd.DataFrame.from_dict(data2) # pd.DataFrame.from_records(data2) A B C D E 0 5.0 NaN 3 3.0 NaN 1 NaN 4.0 7 NaN 6.0 The Panacea: json_normalize for Nested Data 灵丹妙药:嵌套数据的json_normalize

A strong, robust alternative to the methods outlined above is the json_normalize function which works with lists of dictionaries (records), and in addition can also handle nested dictionaries. 上面概述的方法的一种强大而强大的替代方法是json_normalize函数,该函数可用于词典列表(记录),此外还可以处理嵌套词典。

pd.io.json.json_normalize(data) A B C D 0 5 0 3 3 1 7 9 3 5 2 2 4 7 6

pd.io.json.json_normalize(data2) A B C D E 0 5.0 NaN 3 3.0 NaN 1 NaN 4.0 7 NaN 6.0

Again, keep in mind that the data passed to json_normalize needs to be in the list-of-dictionaries (records) format. 同样,请记住,传递给json_normalize的数据必须采用字典列表(记录)格式。

As mentioned, json_normalize can also handle nested dictionaries. 如前所述, json_normalize也可以处理嵌套字典。 Here's an example taken from the documentation. 这是从文档中获取的示例。

data_nested = [ {'counties': [{'name': 'Dade', 'population': 12345}, {'name': 'Broward', 'population': 40000}, {'name': 'Palm Beach', 'population': 60000}], 'info': {'governor': 'Rick Scott'}, 'shortname': 'FL', 'state': 'Florida'}, {'counties': [{'name': 'Summit', 'population': 1234}, {'name': 'Cuyahoga', 'population': 1337}], 'info': {'governor': 'John Kasich'}, 'shortname': 'OH', 'state': 'Ohio'} ]

pd.io.json.json_normalize(data_nested, record_path='counties', meta=['state', 'shortname', ['info', 'governor']]) name population state shortname info.governor 0 Dade 12345 Florida FL Rick Scott 1 Broward 40000 Florida FL Rick Scott 2 Palm Beach 60000 Florida FL Rick Scott 3 Summit 1234 Ohio OH John Kasich 4 Cuyahoga 1337 Ohio OH John Kasich

For more information on the meta and record_path arguments, check out the documentation. 有关meta和record_path参数的更多信息,请查阅文档。

Summarising 总结

Here's a table of all the methods discussed above, along with supported features/functionality. 这是上面讨论的所有方法的表格,以及受支持的功能部件/功能。


* Use orient='columns' and then transpose to get the same effect as orient='index' . *使用orient='columns'然后转置以获得与orient='index'相同的效果。


I know a few people will come across this and find nothing here helps. 我知道会有几个人遇到这个问题,但这里没有任何帮助。 The easiest way I have found to do it is like this: 我发现最简单的方法是这样的:

dict_count = len(dict_list) df = pd.DataFrame(dict_list[0], index=[0]) for i in range(1,dict_count-1): df = df.append(dict_list[i], ignore_index=True)

Hope this helps someone! 希望这对某人有帮助!




